#!pip install pyLDAvis
#!pip install wordcloud
from collections import Counter
import gensim
import matplotlib.pyplot as plt
%matplotlib inline
import nltk
from nltk.corpus import stopwords
from nltk.corpus import movie_reviews
import numpy as np
import os
import pandas as pd
import pyLDAvis.sklearn
import re
import seaborn as sns
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.decomposition import LatentDirichletAllocation, PCA
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.utils import shuffle
from sklearn.manifold import TSNE
import spacy
from wordcloud import WordCloud
import warnings
"ignore", category=DeprecationWarning) warnings.filterwarnings(
Youth Perception of Sexual Education in India
For my project, I have chosen to analyze data from a survey I conducted in May 2020 with the goal of discerning what the youth of India thinks about the Indian Sexual Education curriculum. While India continues to make phenomenal progress in global fields, there is one particular platform in which it still lacks direction and growth - Education, particularly Sexual Education (Sex Ed). While countries like USA have State specific laws that target the implementation of Sex Ed in schools, until very recently, India was still debating the merits of including it in the curriculum at all. A recent Instagram scandal in May 2020, termed the ‘Bois Locker Room’ made headlines when a group of adolescent boys were outed by their classmates for a group chat that perpetuated rape culture, objectification of women and criminal behaviors of morphing private photographs of women. While there are many nuances to this issue, it brought to the forefront the need to destigmatize sexual culture amongst teenagers and facilitate dialogue within the education systems on topics that are still considered taboo - topics such as safe sex, gender identity, attraction, consent etc. from more than a paltry biological point of view.
The ‘Bois Locker Room’ event made me feel complicit to a system that deliberately denies students the tools to make well-adjusted choices about their lives and instead polices them morally, resulting in disastrous consequences. I’ve always thought that inculcating Sex Ed topics into a supportive and fact-based school curriculum would prevent the hazard of misinformation that teenagers have to sift through every day. In a study conducted by the Indian Ministry of Women and Child development in conjunction with UNICEF in 2007 showed that 53% of children in India faced sexual abuse of some kind and a majority of those went unreported. Academia has shown time and again that Sex Ed plays a major role in sexual violence prevention. Drawing from that, it is also my opinion that de-stigmatization and normalization of sexual topics will lead to better adjusted adolescents who understand concepts of consent & safe sex and will eventually lead to lower sexual harassment cases or crimes.
Survey Parameters
The survey was conducted via Google forms and was anonymous. It was restricted to Indians in the age-group of 16-28. Any Board of educational curricula (International Baccalaureate, ICSE, IGCSE etc.) was accepted as long as the individual completed High School in India. The main question I ask in this survey, is the following: _
“What do the youth perceive sexual education in India as?”
_ I picked this question given that the main goal of the research is to present it to Education Technology or Educational Reform institutions in order to help curate a curriculum that caters to what the youth wants. I believe that the survey will indicate their dissatisfaction with the current curriculum. It’s an ambitious question and to help break it down, I have organized the notebook into sections that correspond to different questions. Currently, the survey has 112 responses.
Process
1. Demographic Survey Data
- Analyze and visualize the demographic distribution of the participants in order contextualize findings and understand potential biases in the data set. For instance, is the sample representative of my target population of the youth of India?
2. Perceptual and Curricular Data
Out of a corpus of topics covering “taboo” topics directly or indirectly connected to Sex Ed, which topics were considered the most important by the participants?
How often were these topics addressed by their school?
Have the participants experienced sexual policing by an authority figure in their schooling environment?
Tools: For both (1) and (2), I used the pandas library for data manipulation and matplotlib and seaborn libraries for visualizing the results.
3. Textual Analysis of Free Response Data
Extract topics from the answers of the free response question in the survey
Workflow:
- Text-preprocessing by removing punctuations and stopwords as well as tokenization.
- Analyzing word frequencies and visualizing them using Word cloud and Bar plots.
- Using the scikit-learn library (TfidfVectorizer) to create the Document-Term Matrix
- Fitting it to a LDA model
- Visualizing extracted topics using pyLDAvis
- Analyzing the contextual similarity of the words using a Word2Vec model
- Interpreting that using a combination of Principle Component Analysis (PCA) and T-distributed stochastic neighbor embedding (TSNE).
# Where am I?
%pwd
'/Users/anushasubramanian/Desktop/Project101'
= pd.read_csv("Education_survey.csv")
survey del survey["Add Comments"]
DEMOGRAPHIC SURVEY DATA
With the aim of understanding the distribution of participants and putting the data in context of accessible services, belief systems, and how representative the sample is. It is particularly relevant in a county such as India where pockets of rural and urbanism co-exist.
Age & High School City of Participants
Age: Current Age
High School City: The city participants graduated high school from.
#extract unique age data from survey
= dict(survey["Age"].value_counts())
age = pd.DataFrame()
age_df "age"] = age.keys()
age_df["frequency"] = age.values()
age_df[
#extract and clean location data
= dict(survey["High School City"].value_counts())
city "Dehradun"] = city["Dehradun "] + city["Calcutta, Dehradun "] + city["Doon"]
city[del city["Dehradun "], city["Calcutta, Dehradun "], city["Doon"]
= pd.DataFrame()
city_df "City"] = city.keys()
city_df["Frequency"] = city.values() city_df[
= plt.subplots(1, 2, figsize =(18,6.27))
f, axes
#plot age data
#sns.set(rc={'figure.figsize':(10,6.27)})
= sns.barplot(x = "frequency",
age_barplot = "age",
y = age_df,
data = "h", palette = 'viridis', ax = axes[0])
orient
set(xlabel = "Number of Participants", ylabel = "Age", title = "Age Distribution of Participants in Survey");
age_barplot.
#plot location data
set(rc={'figure.figsize':(10,6.27)})
sns.= sns.barplot(x = "City",
city_barplot = "Frequency",
y = city_df,
data = "v", palette = 'Set2', ax = axes[1])
orient set(xlabel = "City", ylabel = "Number of Participants", title = "City in which Participants finished High School");
city_barplot.
=True) sns.despine(left
A large proportion of our participants are in the 19-20 age-group, which means that they very recently graduation from high school. Setting it in context, this was also approximately the age-group that was involved in the Bois Lockeroom Scandal.
An overwhelmingly large proportion of participants completed their schooling in Mumbai - a metropolitan city. Since I am from Mumbai, it is possible that the initial dissemination of the survey was largely to a Mumbai based population who them propagated it to their own Mumbai-based networks. I will discuss the limitations of this at the end of this project.
Gender Identity of Participants
Self-report measure of what gender participants identify as.
#extract gender data
= pd.DataFrame(index = ['Female','Male','Gender Variant/Non-Conforming', 'Prefer not to say'])
gender_df = list(survey["Gender Identity"].value_counts())
freq "percentage"] = freq
gender_df[
#plot gender pie chart
= gender_df.plot.pie(y = "percentage", figsize = (9,9), autopct ='%1.1f%%',
gender_pie = ["lightblue","lightsteelblue", "lightpink", "lightyellow"], legend = False);
colors "Gender Identity of Participants"); plt.title(
India is still fairly conservative when it comes to recognizing non-binary gender identities. This could account for the very low proportion of such individuals in the sample.
More than half the sample of participants identified as female. This could be due to dissemination in gender-skewed networks (I sent them out to more females and they in turn sent it to more-oriented networks and so on), the tendency of particular genders to respond more actively to online surveys than others, or even the cultural need to conform to the binary (individuals who would categorize themselves as non-binary in private were hesitant to do so on a survey).
Spectrum of Religious Affiliations
# Pie chart
= dict(survey["Religious Affiliation "].value_counts())
religion
#Cleaning up repeat data in the survey
"Undecided"] = religion["Still in the process of understanding concept of religion and god"] + religion["No clue "]
religion[del religion["Still in the process of understanding concept of religion and god"], religion["No clue "]
#define variables for the pie-chart
= list(religion.values())
sizes = ["Agnostic","Religious","Spiritual","Atheist","Undecided","Prefer not to Say"]
labels
#colors
= ['#ff9999','#66b3ff','#99ff99','#ffcc99','lightpink',"gold"]
colors
= plt.subplots()
fig1, ax1 = colors, labels=labels, autopct='%1.1f%%')
ax1.pie(sizes, colors
#draw circle
= plt.Circle((0,0),0.70,fc='white')
centre_circle = plt.gcf()
fig
fig.gca().add_artist(centre_circle)
# Equal aspect ratio ensures that pie is drawn as a circle
'equal')
ax1.axis("Religious Affiliations of Participants")
plt.title(
plt.tight_layout() plt.show()
The youth sampled in this survey can be considered representative of the various religious modes of thought prevalent in India. However, I would be careful to assume that they were representative of religions themselves since I didn’t ask for particular religious ties such as Hinduism or Islam etc.
PERCEPTUAL AND CURRICULAR SURVEY DATA
Students were asked to answer various forced-choice or rating questions regarding a list of topics relevant directly or indirectly to sexual education. The topics were descriptive because it is near impossible to encapsulate the range of concepts like “mental health” or “sexual health” under one umbrella term. However, to increase ease of understanding and readability of code, the descriptive topics have been shortened. A table of reference is provided below.
Shortened Form | Description |
---|---|
svsh | Sexual/Domestic Violence and Identifying Workplace Harassment |
mental_health | Comprehensive Mental Health Awareness, Assessments & Resources |
bullying | Tackling Bullying & Identifying Toxic Environments |
sex_ed | Comprehensive Sex Education beyond basic reproduction (safe sex, consent, pleasure, desire) |
gender_identity | Gender Identity & Sexuality |
menstruation | Normalization of Menstruation |
cyber_crime | Comprehensive understanding of Cyber Etiquettes, Crimes, Laws |
Which topics were rated the highest in importance by the participants?
Participants were asked to rate the importance of including the topics listed above in their school curriculum, on a 0-5-point scale. I summed the individual scores obtained by each topic. The graph depicts the cumulative score out of 560.
#sum the columns
= pd.DataFrame()
ratings = ["svsh","mental_health","bullying","sex_ed","gender_identity","menstruation","cyber_crime"]
topics = [survey["Imp SVSH"].sum(),survey["Imp Mental Health"].sum(),survey["Imp Bullying"].sum(),survey["Imp Sex Ed"].sum(),
freq "Imp Gender Identity "].sum(),survey["Imp Menstruation"].sum(),survey["Imp Cyber Crime"].sum()]
survey["topics"] = topics
ratings["score"] = freq
ratings[= 'score', inplace = True, ascending = False) ratings.sort_values(by
#plot barplot
set(rc={'figure.figsize':(12,6.27)})
sns.= sns.barplot(x = "topics", y = "score", data = ratings, palette = 'rocket', orient = "v");
imp_barplot set(xlabel = "Score out of 560", ylabel = "Topics",
imp_barplot.="Cummulative Rated Importance of Topics in Curriculum");
title for p in imp_barplot.patches:
format(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height()),
imp_barplot.annotate(= 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points') ha
I expected more variations than this graph depicts. Mental Health and Bullying are the two most important topics according to the participants. However, looking at how close in score the other topics are, it’s not surprising because it is possible that a vast majority of teenage/adolescent mental health problems and incidents of bullying are manifestations of issues that arise from improper education and understanding of the topics that rank lower on the graph. Something that struck me as particularly interesting was that Cyber Crime ranked last on the list, even though the survey was conducted in the middle of an episode of national outrage regarding a cyber space scandal.
According to the participants, how often were these topics address in their school?
Participants had to rate on a modified Likert Scale, how often the same topics were addressed in their schools. Oftentimes, Indian officials tout the excuse of the “public is not ready” or the “public doesn’t want it” as way of brushing shortcomings aside. This question along with the previous one was incorporated to understand whether there existed a discrepancy between the desire of youth to gain information on taboo topics and the frequency in which such information was disseminated in the educational institutions.
Scale 1. Never 2. Rarely 3. Occasionally 4. Frequently 5. Very Frequently
= ["SVSH", "Mental Health", "Bullying", "Sex Ed","Gender Identity", "Menstruation", "Cyber Crime"]
topics = ["Never","Rarely","Occasionally","Frequently","Very Frequently"]
metric
#initialise DataFrame
= pd.DataFrame(index = topics)
matrix_df
= [],[],[],[],[]
never,rarely, occasionally, freq, vfreq
#column numbers we want to extract from survey
= [20,21,22,23,24,25,28]
columns
#our data is not tidy. This loop is to create a frequency matrix of the form topicx x metrics
for i in columns:
= dict(survey[survey.columns[i]].value_counts())
count for j in metric:
if j not in count.keys():
= 0
count[j] "Never"])
never.append(count["Rarely"])
rarely.append(count["Occasionally"])
occasionally.append(count["Frequently"])
freq.append(count["Very Frequently"])
vfreq.append(count[
# add data to your DataFrame
"Never"] = never
matrix_df["Rarely"] = rarely
matrix_df["Occasionally"] = occasionally
matrix_df["Frequently"] = freq
matrix_df["Very Frequently"] = vfreq
matrix_df[= "Very Frequently", inplace = True, ascending = True)
matrix_df.sort_values(by
matrix_df
= 'RdPu', annot=False);
sns.heatmap(matrix_df,cmap "How frequently are the Topics addressed in School"); plt.title(
There’s almost an even split in the heatmap with the right side dominated by light tones (corresponding to lower number of responses) while the left side is much darker (higher number of responses). Using the given scale to interpret the map, it is evident that very few participants felt that any of these topics were addressed very frequently or frequently.
Topic | Most Responded… | Lowest Response… |
---|---|---|
SVSH | Never | Very Frequently |
Gender Identity | Never | Very Frequently |
Bullying | Occasionally | Very Frequently |
Sex Ed | Never | Very Frequently |
Mental Health | Rarely | Very Frequently |
Menstruation | Never | Very Frequently |
Cyber Crime | Never | Very Frequently |
The table above shows that the participants felt that a majority of the topics in this survey were “Never” addressed in their school environment. It is concerning on many levels to hear that basic health conversations such as Menstruation were also never addressed, even in an educational sense.
76 participants felt that Gender Identity was Never addressed in their school - the highest vote in the heatmap. This makes sense given the Indian context. While the Indian landscape accepts the idea of sexual relations, albeit biologically, a large proportion of the country still has trouble with the idea of non-binary Gender identities. And since Education systems and curricula tend to run on a lag, not updating textbooks and details until decades after movement, educational resources about this topic are nearly non-existent in this conservative system.
TEXTUAL ANALYSIS ON FREE RESPONSE ANSWERS
The aim of this segment of data analysis is to portray whether the youth of India is satisfied of dissatisfied with the way the topics in this survey are addressed in the study. It’s meant to encapsulate all of the above perceptual questions and act as a platform for participants to share their diverse experiences in the Indian Education System.
Q: ‘How do you think these topics were approached in your school environment?’
The free responses that I am analyzing in this section are answers to the question, above. I believe that topics extracted will indicate dissatisfaction and discontent. Although I did see a few positive responses in the survey, it is my opinion that the majority opinion will be one of discontent.
The data, which was downloaded as a csv. file from Google forms, had to be manually cleaned up before I could begin the text preprocessing. There were a lot of unintended white spaces, typographic errors, abbreviations, incorrectly formatted digits that I had to correct prior to computationally cleaning up the data. It made me realize that keeping end goals and the methods with which you will be analyzing your data at the back of your mind, will always help design better suited data collection surveys, interviews etc. Had I done that; the cleaner data I would have obtained would have saved me a lot of time in the preprocessing steps.
Text Preprocessing
= list(survey["Free Response"])
response 0:2] response[
['In India (Mumbai) it was introduced very briefly and done in a manner that boys and girls had separate trainings. This sort of defeated the point of normalizing sex related topics as we were not allowed to discuss it freely. In LA, these topics were more normalized (especially via mandatory trainings and online courses that had to be fulfilled prior to beginning freshman year courses).',
'Being in an all-boys school, I think they took this pretty casually as it was so commonplace.']
Removing Punctuations
#remove punctuations and save in new list
from string import punctuation
= []
text_processed for r in response:
for char in punctuation:
= r.lower().replace(char,"")
r
text_processed.append(r)0:2] text_processed[
['in india mumbai it was introduced very briefly and done in a manner that boys and girls had separate trainings this sort of defeated the point of normalizing sex related topics as we were not allowed to discuss it freely in la these topics were more normalized especially via mandatory trainings and online courses that had to be fulfilled prior to beginning freshman year courses',
'being in an allboys school i think they took this pretty casually as it was so commonplace']
#add the processed text to our dataframe for use later
"Text processed"] = text_processed survey[
# Save the "Text_processed" list as one long string
= ','.join(text_processed)
long_string #long_string
Tokenization & Removing Stopwords
#Tokenize long string
= long_string.split()
es_tokens
#Remove stop words
= stopwords.words("english")
stop = [word for word in es_tokens if word not in stop]
no_stop
#define a string without stop words
=(" ").join(no_stop)
unique_string
# a copy for later use
= no_stop copy
Short Text Topic Modeling
Learning the Vocabulary
# Define an empty bag (of words)
= CountVectorizer()
vectorizer
# Use the .fit method to tokenize the text and learn the vocabulary
"Text processed"])
vectorizer.fit(survey[
# Print the vocabulary
#vectorizer.vocabulary_
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)
Encoding the Documents
= vectorizer.transform(survey["Text processed"])
vector print(vector) #
#
#
#
#
#print(vector.shape)
#print(type(vector))
(0, 25) 1
(0, 31) 3
(0, 42) 1
(0, 64) 1
(0, 68) 1
(0, 83) 1
(0, 86) 1
(0, 157) 2
(0, 174) 1
(0, 191) 1
(0, 200) 1
(0, 233) 1
(0, 281) 1
(0, 282) 1
(0, 284) 1
(0, 295) 1
(0, 313) 2
(0, 355) 3
(0, 364) 1
(0, 381) 1
(0, 386) 2
(0, 399) 1
(0, 426) 1
(0, 427) 1
(0, 449) 1
: :
(110, 480) 1
(110, 485) 1
(110, 490) 1
(110, 509) 1
(110, 587) 1
(110, 614) 1
(110, 672) 1
(110, 713) 1
(110, 733) 2
(110, 757) 1
(110, 792) 2
(110, 799) 1
(110, 817) 1
(111, 2) 1
(111, 265) 1
(111, 283) 1
(111, 311) 1
(111, 313) 1
(111, 405) 1
(111, 650) 1
(111, 725) 2
(111, 727) 1
(111, 733) 1
(111, 785) 1
(111, 798) 1
# View as a multidimensional array before converting to data frame
# Rows are the documents
# Columns are the terms
print(vector.toarray())
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 1 ... 0 0 0]]
# What are the terms?
vectorizer.get_feature_names()
['ability',
'able',
'about',
'above',
'academics',
'accepted',
'accolades',
'acknowledging',
'action',
'actions',
'actually',
'additional',
'additionally',
'addressed',
'addressing',
'adults',
'affects',
'affiliated',
'aforementioned',
'after',
'again',
'against',
'age',
'all',
'allboys',
'allowed',
'alma',
'also',
'always',
'am',
'an',
'and',
'any',
'anyone',
'appointed',
'appreciation',
'approached',
'appropriately',
'are',
'aren',
'arent',
'around',
'as',
'asked',
'aspects',
'assemblies',
'assembly',
'assume',
'at',
'attacked',
'attention',
'authorities',
'authority',
'avoid',
'avoided',
'aware',
'awareness',
'away',
'awfully',
'awkward',
'bad',
'badly',
'barely',
'basic',
'be',
'because',
'been',
'before',
'beginning',
'behavior',
'being',
'believe',
'believed',
'best',
'better',
'between',
'big',
'biology',
'blind',
'board',
'books',
'both',
'boy',
'boys',
'bras',
'brief',
'briefly',
'brilliant',
'broached',
'broadly',
'brushed',
'bullying',
'but',
'buzz',
'by',
'call',
'called',
'came',
'can',
'captain',
'care',
'carmel',
'carpet',
'cases',
'casually',
'certain',
'change',
'changed',
'changing',
'children',
'chose',
'class',
'classes',
'classmates',
'classroom',
'closest',
'clothes',
'coach',
'coed',
'coincidentally',
'college',
'colored',
'come',
'comfortable',
'commendable',
'comments',
'common',
'commonplace',
'compared',
'compulsory',
'concerned',
'conduct',
'conducted',
'conform',
'conforming',
'confronting',
'conscious',
'consciously',
'consensual',
'consequences',
'conservative',
'conservatively',
'consider',
'considered',
'considering',
'controversial',
'conversations',
'convey',
'correction',
'could',
'couldve',
'council',
'councilor',
'counsellor',
'counselor',
'countries',
'course',
'courses',
'cover',
'covered',
'create',
'creates',
'crime',
'crises',
'crop',
'culture',
'curriculum',
'cuz',
'cyber',
'cynical',
'date',
'debate',
'deemed',
'deep',
'defeated',
'definitely',
'degree',
'deliberately',
'destigmatization',
'detached',
'detail',
'details',
'dictating',
'did',
'didn',
'didnt',
'different',
'differing',
'disappointing',
'disciplinary',
'discrimination',
'discuss',
'discussed',
'discussion',
'discussions',
'dissuade',
'do',
'doctor',
'does',
'don',
'done',
'dont',
'down',
'draw',
'dress',
'dresscode',
'drummed',
'dumbed',
'during',
'early',
'education',
'educational',
'effective',
'effort',
'efforts',
'eg',
'eight',
'eighth',
'either',
'elder',
'eleven',
'eleventh',
'else',
'embrace',
'emphasis',
'emphasized',
'empowerment',
'encouraged',
'enough',
'envelope',
'environment',
'equal',
'equality',
'especially',
'etc',
'even',
'ever',
'every',
'everyone',
'everything',
'everywhere',
'example',
'expected',
'experienced',
'experts',
'explained',
'explored',
'exposed',
'expressing',
'extent',
'extremely',
'eye',
'face',
'faced',
'facing',
'fact',
'faculty',
'failed',
'falters',
'far',
'fear',
'feel',
'felt',
'female',
'females',
'few',
'field',
'fields',
'financial',
'firmly',
'five',
'focuses',
'follow',
'followed',
'football',
'for',
'form',
'formality',
'formative',
'frankly',
'freedom',
'freely',
'freshman',
'from',
'fulfilled',
'fully',
'fundamental',
'furthermore',
'games',
'gay',
'gender',
'generally',
'generic',
'giggle',
'girl',
'girls',
'give',
'given',
'go',
'going',
'good',
'gossip',
'got',
'gotten',
'grade',
'grades',
'grounds',
'group',
'growing',
'grown',
'guess',
'guests',
'guys',
'had',
'hairstyles',
'handle',
'happen',
'happened',
'harassment',
'hard',
'has',
'have',
'having',
'he',
'health',
'healthy',
'held',
'help',
'helpful',
'helpless',
'hence',
'hesitate',
'highlighted',
'hint',
'hips',
'history',
'hitting',
'holding',
'home',
'homosexuality',
'hour',
'how',
'however',
'hv',
'hygiene',
'ideas',
'identity',
'if',
'ignored',
'im',
'imp',
'impact',
'importance',
'important',
'impressionable',
'in',
'inability',
'inappropriate',
'incident',
'including',
'inclusion',
'incorporate',
'incorrect',
'indeed',
'india',
'indian',
'industry',
'ineffectively',
'inefficiency',
'information',
'informed',
'insistence',
'insisting',
'institution',
'insufficiently',
'integrate',
'interest',
'interested',
'international',
'intimacy',
'into',
'introduced',
'irrespective',
'is',
'isn',
'issues',
'it',
'its',
'joke',
'jokes',
'judgmental',
'just',
'justify',
'keen',
'keyword',
'kids',
'knee',
'knew',
'know',
'la',
'late',
'later',
'learn',
'learnt',
'least',
'lectures',
'length',
'let',
'level',
'levels',
'liberal',
'liberally',
'life',
'like',
'line',
'listed',
'literally',
'little',
'longed',
'looked',
'looking',
'lot',
'made',
'make',
'making',
'male',
'mandatory',
'manner',
'many',
'mater',
'mature',
'may',
'maybe',
'me',
'members',
'menstrual',
'menstruation',
'mental',
'mentality',
'mention',
'mentioned',
'mentioning',
'mentored',
'merely',
'middle',
'minimal',
'minimize',
'minute',
'moral',
'more',
'moreover',
'most',
'mostly',
'mount',
'moving',
'much',
'mumbai',
'must',
'my',
'myself',
'naively',
'national',
'natural',
'necessary',
'need',
'needed',
'neglected',
'neither',
'never',
'ngo',
'ninth',
'no',
'nonacademic',
'none',
'nor',
'normal',
'normalize',
'normalized',
'normalizing',
'norms',
'not',
'now',
'obey',
'occasionally',
'occasions',
'of',
'offensive',
'often',
'ok',
'old',
'on',
'once',
'one',
'ones',
'online',
'only',
'open',
'opinion',
'opportunity',
'opposed',
'optional',
'or',
'order',
'organize',
'orgasms',
'orthodox',
'other',
'others',
'otherwise',
'our',
'out',
'over',
'overlooked',
'part',
'particular',
'pedestal',
'penalized',
'people',
'period',
'periods',
'person',
'personal',
'personally',
'perspective',
'picked',
'place',
'places',
'planning',
'play',
'played',
'point',
'points',
'policing',
'portion',
'positions',
'power',
'ppt',
'pretty',
'prevent',
'preventive',
'prior',
'problems',
'programs',
'promoting',
'property',
'psych',
'pu',
'punishment',
'put',
'qualifications',
'questioned',
'quite',
'randomized',
'rarely',
'rather',
're',
'reached',
'real',
'realize',
'really',
'receptive',
'redness',
'reflected',
'regulations',
'reinforced',
'related',
'religious',
'remember',
'repercussions',
'respected',
'response',
'responsible',
'rest',
'restrictions',
'resulting',
'right',
'rights',
'roles',
'room',
'rooms',
'rule',
'rules',
'safe',
'said',
'same',
'saw',
'say',
'school',
'schools',
'science',
'scientifically',
'scratched',
'second',
'seemed',
'seen',
'self',
'seminar',
'seminars',
'senior',
'sensitive',
'sent',
'separate',
'separately',
'separating',
'seriously',
'session',
'sessions',
'settings',
'seven',
'severe',
'sex',
'sexed',
'sexes',
'sexism',
'sexual',
'sexuality',
'shaming',
'shirtless',
'short',
'shorter',
'should',
'shouldnt',
'showed',
'showing',
'shy',
'simply',
'situation',
'sixth',
'skillfully',
'skills',
'skirt',
'skirts',
'skirtsshorts',
'sleevelessshort',
'slight',
'small',
'smaller',
'so',
'social',
'sociopolitical',
'solution',
'some',
'somehow',
'something',
'sometimes',
'somewhat',
'sort',
'speak',
'speakers',
'speaking',
'spoke',
'spoken',
'sports',
'staff',
'staircases',
'stance',
'status',
'stds',
'stick',
'stigma',
'stigmatized',
'still',
'stopped',
'straight',
'student',
'students',
'studied',
'study',
'stuff',
'subconsciously',
'subject',
'subjects',
'such',
'suddenly',
'sufficient',
'superficially',
'supported',
'sure',
'surface',
'suspensionexpulsion',
'syllabus',
'system',
'taboo',
'tackle',
'tackled',
'taking',
'talk',
'talked',
'talking',
'talks',
'taught',
'teach',
'teacher',
'teachers',
'teaches',
'teaching',
'team',
'tech',
'teenagers',
'teens',
'ten',
'tend',
'terminology',
'terms',
'textbook',
'than',
'that',
'thats',
'the',
'their',
'them',
'themselves',
'then',
'there',
'these',
'they',
'theyre',
'things',
'think',
'this',
'those',
'though',
'thought',
'through',
'till',
'time',
'times',
'to',
'tokenistically',
'told',
'too',
'took',
'topic',
'topics',
'tops',
'touch',
'touched',
'tough',
'towards',
'training',
'trainings',
'transition',
'tried',
'trust',
'try',
'tshirt',
'turned',
'twelve',
'twenty',
'two',
'unaware',
'uncomfortable',
'under',
'understand',
'understanding',
'understood',
'uneducated',
'unfortunate',
'uniform',
'unique',
'unnecessary',
'unwanted',
'up',
'upheld',
'upon',
'us',
'used',
'usually',
'utmost',
'vague',
'value',
'very',
'via',
'video',
'vocational',
'voicing',
'volunteers',
'walk',
'wanted',
'wary',
'was',
'wasn',
'waste',
'way',
'ways',
'we',
'wear',
'wearing',
'webinars',
'week',
'well',
'went',
'were',
'weren',
'werent',
'what',
'whats',
'when',
'where',
'which',
'while',
'who',
'why',
'willing',
'wish',
'with',
'without',
'women',
'won',
'words',
'wore',
'workplace',
'workshops',
'worst',
'would',
'wrong',
'year',
'yearly',
'years',
'yet',
'you',
'young',
'your']
Extract Bigrams
Bigrams are two adjacent elements from a string of tokens. Bigrams help place words and phrases into context. For instance, in a United Nations document, a model may classify ‘human’ and ‘rights’ as two different tokens and hence provide us with word frequencies for both, separately. However, given the context, word frequency of the bigram ‘human rights’ would be more insightful. This is why extracting bigrams is helpful.
In my project, there weren’t any distinct bigrams as far as I could tell. This could be due to the short nature of the responses and the small size of the data set, overall. It was definitley still worth running the data through a bigram analyser.
# What other processing steps could you include here
# ... instead of doing them manually above?
= CountVectorizer(ngram_range = (1,2),
bigram_vectorizer = "english",
stop_words = r'\b\w+\b',
token_pattern = 1)
min_df bigram_vectorizer
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 2), preprocessor=None, stop_words='english',
strip_accents=None, token_pattern='\\b\\w+\\b', tokenizer=None,
vocabulary=None)
# Analyze long_string in the bigram bag of words
= bigram_vectorizer.build_analyzer()
analyze
= analyze(unique_string) vocab
Metrics of Word Frequency
Word frequency analysis can be a double-edged sword. On one hand, word frequencies offer a lot of “big-picture” insight into our data set. Borrowing from the UN example that I used before, looking at the (potentially high) frequency of the bigram ‘human rights’ and interpreting whether other frequent words in the corpus together form a political, social or internationally characterized lexicon would lead us to infer that the corpus consists of documents from UN or a similar organization. However, only looking at ‘human’ or ‘rights’ word frequencies might not lead us to the same conclusion. Thus, in computational approaches to social science, a humanistic understanding of the source and context of your data set is always helpful.
Word Frequency analysis on my dataset could prove helpful because of its small size. I know that this dataset contains participants’ responses to questions regarding topics covered in our education system. From the age demographics, I know that they are primarily in the range of 19-20. This in conjunction with other information helps put certain word frequency results in context and hence, understand them better. I will provide specific examples of where context-driven interpretations are helpful, alongside the visualizations of word-frequency metrics.
# Show the 20 most commons
= Counter(vocab)
freq = pd.DataFrame(freq.most_common(20), columns = ["Word", "Frequency"])
stop_df #the difference is that now the bigrams are aligned stop_df
Word | Frequency | |
---|---|---|
0 | topics | 38 |
1 | school | 35 |
2 | girls | 19 |
3 | students | 17 |
4 | think | 13 |
5 | discussed | 12 |
6 | teachers | 12 |
7 | covered | 11 |
8 | t | 11 |
9 | schools | 10 |
10 | boys | 9 |
11 | sex | 9 |
12 | topic | 9 |
13 | rarely | 9 |
14 | approached | 8 |
15 | grade | 8 |
16 | really | 8 |
17 | issues | 8 |
18 | like | 8 |
19 | dont | 8 |
WORDCLOUD
# Define a word cloud variable
= WordCloud(background_color = None,
cloud = 20,
max_words = 5,
contour_width = 600, height = 300,
width = 6, colormap= 'inferno', mode = "RGBA")
random_state
# Process the word cloud
cloud.generate(unique_string)
# Visualize!
#cloud.to_file('WordCloud.png')
cloud.to_image()
One of the primary advantages of a Word Cloud as a visualization is that the frequent words jump out at you almost immediately. It’s easy to note some words that take up the most space = ‘topic’, ‘school’, ‘never’, ‘girl’, ‘student’, ‘discussed’ etc. Taken in isolation, they indicate a culture of schools never (or rarely) discussing, ‘approaching’ or ‘covering’ such topics as those asked in the question.
An example of the context driven interpretation that I mentioned can be seen in the word ‘grade’. As an Indian and an individual who has gone over their dataset many times, I understand that in this context, ‘grade’ is used to refer to numerical levels of education, such as the 5th grade or the 8th grade. Here, it does not mean ‘grade’ as in to assign a grade such as A, B+ or F.
BARPLOT
set(rc={'figure.figsize':(10,6.27)})
sns.= sns.barplot(x = "Frequency",
hr_barplot = "Word",
y = stop_df,
data = "h")
orient #plt.savefig('Frequency of Words Barplot.png', dpi = 180, bbox_inches='tight', transparent = True)
Here is a different way of visualizing the same data since looking at them differently sometimes provides different insights. This is a bar plot of the 20 most frequently used words in our responses. They represent the same corpus of words shown in the Word cloud.
Insolation, they don’t indicate much. The two most common words ‘topics’ and ‘school’ are not surprising because they are what the survey question asks about. Almost every participant would have used those as anchors to provide relevant answers to the question. However, when considered in conjunction with our demographic and curricular data, the words gain context, and hence make more sense.
For instance, we know that participants felt that a vast majority of topics detailed in the survey were ‘Never’ discussed (Heatmap). This adds additional weight to words such as ‘discussed’, ‘approached’, ‘issues’, ‘dont’, ‘covered’, ‘think’. This cluster of frequent words, combined with high frequency of ‘never’ and ‘rarely’ from this section, and the high rates of ‘Never’ and low rates of ‘Very Frequently’ from the heatmap, spin a narrative centered on the characters of ‘girls’, ‘boys’ and ‘teachers’. If each of those characters independently have a high enough word-frequency to make it on the ‘top 20’ list, it is reasonable to assume that they were addressed as independent entities in the free responses by participants. I’m hesitant to say that ‘girls’ and ‘boys’ being mentioned frequently and separately is indicative of gender disparity when it comes to addressing these topics. Due to the large number of female identifying participants in the survey, it is possible that the personal experiences of girls were talking about more in the responses - resulting in the higher frequency of ‘girls’ because of that.
Either way, I hypothesis that there continues a narrative of teachers rarely addressing or approaching topics in the survey that students in school are interested in.
Fitting our Topic Model
Here I’m using Term frequency-inverse document frequency (TFIDF) wherein words with higher frequency in a response are rated lower (inverse document frequency) in order to build the Document Term Matrix (DTM). TFIDF will encode in a way that preserves the importance of words (in terms of frequency) while a Bag of Words model will simply create a matrix based on word count. This is my rationale behind choosing the TfidfVectorizer( ).
# How many topics?
= 4 n_topics
Our dataset is small; therefore, I believe that there won’t be a large number of topics to extract. Considering more than 5 will result in considerably overlapping topics. Even 5 gave me topics that didn’t seem to have an overarching point. Settled for 3
# TfidfVectorizer to create the DTM
= TfidfVectorizer(max_df = 0.90,
tfidf_vectorizer = 5000,
max_features = "english")
stop_words
# Fit
= tfidf_vectorizer.fit_transform(copy) tfidf
# Instantiate our LDA model
= LatentDirichletAllocation(n_components = n_topics,
lda = 20,
max_iter = 9)
random_state = lda.fit(tfidf) lda
#function to display topics in a formatted manner
def print_top_words(model, feature_names, n_top_words):
for topic_idx, topic in enumerate(model.components_):
print("\nTopic #{}:".format(topic_idx))
print(" ".join([feature_names[i]
for i in topic.argsort()[:-n_top_words - 1:-1]]))
print()
TOPIC EXTRACTION
# Return the topics
= tfidf_vectorizer.get_feature_names()
tf_feature_names 20) print_top_words(lda, tf_feature_names,
Topic #0:
discussed sex approached people health addressed teacher feel actually college sexual institution know team problems related class touched menstruation talk
Topic #1:
topics school schools really education important certain environment sports lot way used level good life introduced crime took obey carpet
Topic #2:
girls students covered boys rarely gender spoken discuss tried mental talking importance talked allowed different curriculum seminars faced separate social
Topic #3:
think teachers topic dont like issues grade things werent discussions taught pretty went deemed weren repercussions surface open considered sessions
VISUALISATION & MULTIDIMENSIONAL SCALING (PCA)
= pyLDAvis.sklearn.prepare(lda_model = lda,
panel = tfidf,
dtm = tfidf_vectorizer,
vectorizer = "PCoA")
mds pyLDAvis.display(panel)
*The Topic numbers are based on the pyLDAvis and not the printed topics - they are the same, only different in numbering.
TOPICS EXTRACTED
Topic # | Overarching Topic | Relevent Words |
---|---|---|
1 | Treated as Hostile | crime, obey, authority,harassment,taking, right, girl, introduced, important, topics, school, education |
2 | Rare Dialogue on Gender & Mental Health | girls, boys, students, covered, rarely, gender, spoken, discuss, tried, mental, talked, importance, different, curriculum, separate, social,workshops, need, wore, planning |
3 | Taught not Discussed | think, teachers, topic, dont, like, issues, werent, discussions, taught, repercussions, surface, sessions, taboo, didnt, tackle |
4 | Emphasis on Health | discussed, sex, health, approached, teacher, college, sexual, institution, problems, related, touched, menstruation, mentality, discrimination, opinion, rules, given, equality, policing, situation |
Contextual Similarity using Word2Vec
# First, store the documents we want to explore in a separate dataframe with just one column
= pd.DataFrame({'Processed': survey["Text processed"]})
w2v_df #w2v_df
# Turn the text of each row into a list
# We now have a list of lists - one for each document
= [row.split() for row in w2v_df['Processed']]
split_rows #split_rows
= []
no_stop for response in split_rows:
for word in response if word not in stop])
no_stop.append([word #no_stop
Define Model
There were two possible methods of vectorizing the words: skip grams or CBOW (Continuous Bag Of Words). The essential difference between the two is that in CBOW, context (a set of words within a fixed window) is used to predict the word in the middle, while for Skip grams, the model learns to predict a word using another nearby word. I decided to go with CBOW even though Skip grams are better for smaller amounts of data because CBOW is better at predicting in a data set that contains more frequent words. Free responses contain a lexicon of daily speech. They are unlikely to contain rare words. Therefore, in my opinion CBOW will give me better accuracy.
# Define the word2vec model
= gensim.models.Word2Vec(no_stop,
model = 2,
min_count = 12,
size = 1,
workers = 3,
window = 0) sg
# Save the vocabulary
= list(model.wv.vocab) words
Some Interesting Insights
Comparing Similarities
"topics", "school") model.similarity(
0.016019769
“Pick the Odd One Out”
"sex","school","topics","discussed"]) model.doesnt_match([
'school'
“Most Similar To”
= ["sex","topics","teachers","school"], topn = 5) model.wv.most_similar(negative
[('pretty', 0.820723295211792),
('training', 0.7229318022727966),
('shirtless', 0.6807460188865662),
('actions', 0.6545418500900269),
('rules', 0.5551646947860718)]
= ["education","sex"], topn = 5) model.wv.most_similar(positive
[('disciplinary', 0.6693217754364014),
('covered', 0.5984655022621155),
('bad', 0.5979328155517578),
('policing', 0.589743971824646),
('seen', 0.5874642729759216)]
= ["school","curriculum"], topn = 5) model.wv.most_similar(negative
[('must', 0.7328099608421326),
('class', 0.6630020141601562),
('freely', 0.6044854521751404),
('tried', 0.573231041431427),
('biology', 0.5199291110038757)]
Education is to Sex what Discussed is to ___________
= ["discussed", "sex"],negative = ["education"])[0][0] model.wv.most_similar(positive
'open'
Words that are more than 50% similar to ‘sex’
for word in words if model.similarity("sex",word) > 0.5] [word
['sex',
'tried',
'bad',
'history',
'actions',
'part',
'sixth',
'mostly',
'minute',
'policing']
# Save the word2vec vocab
= model[model.wv.vocab] features
Visualization and Multidimensional Scaling with PCA
I picked PCA over TSNE (T-distributed stochastic neighbor embedding) because TSNE only forms clusters when data is non-sparse. While I do have a sizeable amount of data, it was not enough for it to form clusters when passed through TSNE algorithm. Therefore, I decided to go ahead with PCA.
Principal Component Analysis
I have used Principle Component Analysis (PCA) which is a form of feature extraction for reduced dimensionality of data. It works by separating points as far as possible. It tells us three things:
- Direction in which our data is dispersed
- How each variable is associated with one another
- Relative importance of different directions.
The code below plots our Word2Vec word model on a 2D graph. Although not shown here, the points are plotted in relation to a “best fit line”. PCA will find the best line that both maximizes the variance as well as minimizes distance error (difference between best-fit line and data point). I will be interpreting the data on the bases of (2) which is to analyze clusters of words on the graph. The distance between words indicates which words are closest together in 2D space.
# Define parameters of our PCA
# Just look at the first two dimensions - the X and Y axes
= PCA(n_components = 2)
for_pca = for_pca.fit_transform(features)
pca_out 'font.size': 6})
plt.rcParams.update({'figure.figsize':(20,10)})
plt.rcParams.update({= list(model.wv.vocab)
vocab = model[vocab]
X = PCA(n_components = 2, random_state= 5)
for_pca = for_pca.fit_transform(features)
pca_out = pd.DataFrame(pca_out, index=vocab, columns=['x', 'y'])
df = plt.figure()
fig = fig.add_subplot(1, 1, 1)
ax 'x'], df['y'],s=6, color = 'deeppink')
ax.scatter(df[for word, pos in df.iterrows():
ax.annotate(word, pos)#plt.savefig('PCA Word2vec Plots.png', dpi = 240, bbox_inches='tight')
RESULTS
DISCUSSION
Limitations of Data
SIZE OF DATASET The biggest problem that I had with Word2Vec. Every time I ran my model, it would give me different outputs. A standard analysis become near-impossible. I tried to modify the parameters in order to run the model in a deterministic manner (fixed worker = 1, seed, custom hash function etc.) However, nothing worked. According to FAQs on the RaRe technologies GitHub, Word2Vec does that when the data used to train the model is too little. In this case, I thought it would be best to include some outputs that I had received while working on my project and analyze them.
REPRESENTATIVENESS OF DATA To talk a little more about the assumptions I have made for this topic, it is important to keep in mind that the population sampled in this survey would not be absolutely representative of the Indian population OR education system. The demographics of the survey participants show that they are primarily from well-developed metropolitan areas (Mumbai). India is extremely diverse as a nation and there are many pockets even within uniform states where the differences (both ideological and tangible) between the rural and urban are massive. Additionally, the female dominated gender identity in this data set could have affected many parts of the responses such as the ratings of importance of the topics to be included in school curriculum as well as the free response data. I believe females would have naturally voted topics such as Sexual Harassment in Workplace etc. higher than others since it is directly relevant to their gender identity. Similarly, a large proportion of the textual data could have been dominated by female driven experiences.
Discussion
An important insight for me was the merits of keeping data analysis always at the back of mine while designing surveys or interviews in order to get the best formatted data that will likely save you time when you begin to work on it. When I started this survey, I had not thought through what I hope to achieve by the end of it and thus received data that required manual clean up and standardization before I could begin to process it.
The Perceptual data indicates a strong trend of the youth perceiving lack of information in essential topics related directly or indirectly to sexual education, particularly in the frequency with which they are covered in school. The topics extracted from the LDA Model serve to show the discontent with the current educational curriculum regarding these topics. It is not possible to ignore the fact that youth aren’t getting the sexual education that they want in India, and policymakers can no longer hide under the excuse that the students don’t want to learn it. I hope my project serves to highlight that not providing accurate, frequent information to the youth on topics that they believe are important in a school curriculum will lead to them turning to less reliable sources such as the internet and entertainment media in order to fill that information gap. This is detrimental to their development into responsible, openminded and well-adjusted adults.
While the survey may not have been completely representative of the target population i.e. the youth of India, it does satisfy its aim of gathering data for a preliminary survey on youth perception. In my opinion, the Demographic and Perceptual Survey Data (the first sections of the notebooks) should be given the most importance due to their objectivity and forced-choice answer design. The textual analysis of the free response should be used as a supplementary resource to back up the results obtained from the first two sections. This would be the best course of action since the Textual Analysis provides us with a “Big Picture” view of the entire dataset along with interpretations of word frequency and similarity and the Demographic/Perceptual Data gives us a metric to weigh the fissure that exists between what the youth want to know and think is important, and their perception of how much of it they are getting in their school environments.
Future Direction
The next step would be to present the project and data to Educational Reform institutions in India or Educational Technology companies to make them aware of this knowledge gap that seems to exist within the youth community in India. I believe that they would have the additional resources and expertise that I could use to carry out more surveys and eventually use the data to curate a State (or Central Government) mandated sexual education curriculum for the Indian Certificate of Secondary Education (ICSE) Board, that is dominated by content that the youth is interested in learning about. This curriculum will require faculty to undergo training and participate in workshops to sensitize them to issues that they are out of touch with, before they are allowed to teach them in schools. ***
WORKS CITED
Muzzall, Evan. Notebook Week 5 Link
Dave, Pranay. “PCA vs TSNE - El Clásico.” Medium, Towards Data Science, 30 May 2020, Link
Tran, Khuyen. “How to Solve Analogies with Word2Vec.” Medium, Towards Data Science, 29 Mar. 2020, Link
Kulshrestha, Ria. “NLP 101: Word2Vec - Skip-Gram and CBOW.” Medium, Towards Data Science, 19 June 2020,Link.
Amipara, Kevin. “Better Visualization of Pie Charts by MatPlotLib.” Medium, Medium, 20 Nov. 2019, Link.
Li, Zhi. “A Beginner’s Guide to Word Embedding with Gensim Word2Vec Model.” Medium, Towards Data Science, 1 June 2019,Link.
Scott, William. “TF-IDF for Document Ranking from Scratch in Python on Real World Dataset.” Medium, Towards Data Science, 21 May 2019, Link.
Brems, Matt. “A One-Stop Shop for Principal Component Analysis.” Medium, Towards Data Science, 10 June 2019, Link.
jeffd23. “Visualizing Word Vectors with t-SNE.” Kaggle, Kaggle, 18 Mar. 2017, Link.
alvasalvas, et al. “Ensure the Gensim Generate the Same Word2Vec Model for Different Runs on the Same Data.” Stack Overflow, 1 Aug. 1965, Link.
Gaudard, Olivier. “#11 Grouped Barplot.” The Python Graph Gallery, 7 Sept. 2017, Link.
Raul GonzalesRaul Gonzales et al. “Cant Plot Seaborn Graphs Side by Side.” Stack Overflow, 1 May 1969, Link.
DreamsDreams and Niels JoaquinNiels Joaquin 1. “Visualise word2vec Generated from Gensim.” Stack Overflow, 1 Dec. 1966, Link.
oceandyeoceandye and CathyQianCathyQian. “How Do I Save Word Cloud as .Png in Python?” Stack Overflow, 1 Apr. 1968, Link.
Robert FrankeRobert Franke et al. “Save a Subplot in Matplotlib.” Stack Overflow, 1 June 1960, Link.
Zagorax, et al. “What Is the Operation behind the Word Analogy in Word2vec?” Stack Overflow, 1 Apr. 1968, Link.
alvasalvas “Interpreting Negative Word2Vec Similarity from Gensim.” Stack Overflow, 1 Sept. 1966, Link.
ShaktiShakti “How to Run Tsne on word2vec Created from Gensim?” Stack Overflow, 1 June 1966,Link.